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Fully nonparametric methods for regression from functional data 
have poor accuracy from a statistical viewpoint, reflecting the fact 
that their convergence rates are slower than nonparametric rates for 
the estimation of high-dimensional functions. This difficulty has led 
to an emphasis on the so-called functional linear model, which is much 
more flexible than common linear models in finite dimension, but nev- 
ertheless imposes structural constraints on the relationship between 
predictors and responses. Recent advances have extended the linear 
approach by using it in conjunction with link functions, and by con- 
sidering multiple indices, but the flexibility of this technique is still 
limited. For example, the link may be modeled parametrically or on a 
grid only, or may be constrained by an assumption such as monotonic- 
ity; multiple indices have been modeled by making finite-dimensional 
assumptions. In this paper we introduce a new technique for estimat- 
ing the link function nonparametrically, and we suggest an approach 
to multi-index modeling using adaptively defined linear projections 
of functional data. We show that our methods enable prediction with 
polynomial convergence rates. The finite sample performance of our 
methods is studied in simulations, and is illustrated by an application 
to a functional regression problem. 

1. Introduction. When explanatory variables are functions, rather than 
vectors, the problems of nonparametric regression and prediction are intrin- 
sically difficult from a statistical viewpoint. In particular, convergence rates 
can be slower than the inverse of any polynomial in sample size, and so 
relatively large samples may be needed in order to ensure adequate perfor- 
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mance. Fully nonparametric methods have been studied recently in func- 
tional data regression and related problems (see, e.g., [11, 12] and [8]). The 
slow convergence rates associated with these unstructured nonparametric 
approaches provide motivation for seeking nonparametric approaches that 
exploit a greater amount of structure in the data and are therefore expected 
to have better properties from a statistical perspective. 

Advances in this direction include those made in [1, 9, 10, 14] and [17], 
where both parametric and nonparametric link functions were introduced in 
order to connect the response to a linear functional model in the explana- 
tory variables. However, the flexibility of available link-function models is 
still rather limited. For example, although nonparametric link functions were 
considered in [17], the approaches considered there are restricted by the as- 
sumption of monotonicity, where the corresponding "Generalized Functional 
Linear Model" approach is based on a semiparametric quasi-likelihood based 
estimating equation, which includes known or unknown link and variance 
functions. In contrast, we are aiming here at models with one or several 
nonparametric link functions, ignoring possible heteroscedasticity of the er- 
rors. Our approach provides an alternative to the related methods in [2], 
where single-index functional regression models with general nonparamet- 
ric link functions are considered that may be chosen nonmonotonically and 
without shape constraints. The main differences are that our methodology 
includes the multi-index case, does not anchor the true parameter on a pre- 
specified sieve, and that we provide a detailed theoretical analysis of a direct 
kernel-based estimation scheme that culminates in a convergence result that 
establishes a polynomial rate of convergence. 

Beyond demonstrating that our approach enables prediction with polyno- 
mial accuracy, we also include generalizations to iteratively fitted multiple 
index models, founded on a sequence of linear regressions. Here we borrow 
ideas from dimension reduction in models that involve high-dimensional, but 
not functional, data. When the link function is nonparametric, the intercept 
term in functional linear regression loses its relevance because it is incorpo- 
rated into the link. The slope function is still potentially of interest, but the 
viewpoint taken in this paper is predominately one of prediction rather than 
slope estimation, and in particular our theory focuses directly on the predic- 
tion problem. We refer to the papers by [4, 5] and [7] for further discussion 
of these objectives in the context of the functional linear model. 

We introduce our model and estimation methodology in Section 2. The- 
oretical results regarding the polynomial convergence rate are discussed in 
Section 3, while algorithmic details are described in Section 4, which also 
includes an illustration of the proposed methods with an application to spec- 
tral data. Simulation results are reported in Section 5. Detailed assumptions 
and proofs can be found in the Appendix. 
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2. Model and methodology. 

2.1. Model. Suppose we observe data pairs (X±, Y\), . . . , (X n , Y n ), inde- 
pendent and identically distributed as (X, Y), where X is a random function 
defined on a compact interval I and Y is a scalar. We anticipate that (X, Y) 
is generated as 



where g is a smooth functional and the error has zero mean, finite variance 
and is uncorrelated with X. The model at (2.1) admits many interpreta- 
tions and generalizations, where, for instance, X is a multivariate rather 
than univariate function. For example, X might be (Z,Z'), where Z is a 
univariate function and Z' its derivative. To simplify the developments, we 
shall focus on problems where X is a univariate function of a single variable. 
Models and methodology in more general settings are readily developed from 
the single-variable case. Our approach is described in detail for situations 
where the trajectories of functional predictors can be assumed to be fully 
observed, for example, due to smoothness such as for the Tecator data which 
we analyze with the proposed methods in Section 4.2; it can be extended to 
cases with densely and regularly measured trajectories, where measurements 
may be subject to i.i.d. noise with finite fourth-order moments. This exten- 
sion requires sufficiently dense measurement designs, such that smoothness 
assumptions coupled with suitable smoothing methods lead to sufficiently 
fast uniform rates of convergence when pre-smoothing the data to generate 
smooth trajectories. Such an extension will not be feasible for functional 
data for which only sparse and noisy measurements are available. 

The case where g, in (2.1), is a general functional, even a very smooth func- 
tional, can have serious drawbacks from the viewpoint of practical function 
estimation, since the problem of estimating such a g is inherently difficult 
from a statistical viewpoint. In particular, convergence rates of estimators 
in this case are generally slower than the inverse of any polynomial in sam- 
ple size. Therefore, unless the data set is very large, it can be particularly 
difficult to estimate g effectively. In this respect the commonly assumed func- 
tional linear model, where g(x) = a + J x /3x, a is a scalar and j3 is a regression 
parameter function, offers substantial advantages, for example, polynomial 
convergence rates and even, on occasion, root-n consistency. However, the 
linear- model assumption is often too restrictive in practical applications. 

An alternative approach is to place the linear model inside a link function, 
for example, defining 



V Jx J 

although this, too, is restrictive unless we select the link in a very adaptive 
manner. We suggest choosing the link function g\ nonparametrically. In this 



(2.1) 



Y = g(X) + error, 



(2.2) 




4 



D. CHEN, P. HALL AND H.-G. MULLER 



case the intercept parameter, a, in (2.2) is superfluous; it can be replaced by 
zero, and its effect incorporated into g\. Therefore we actually fit the model 

(2.3) g(x)=g 1 (J^x 

where g\ is subject only to smoothness conditions, and to ensure identifiabil- 
ity, we require a condition on the "scale" of (3, which we choose as f x (3 2 = 1. 
The sign of (3 can be determined arbitrarily. 

2.2. Methodology. We estimate the parameter function /3 and the link 
function g\ in the model at (2.3), using least squares in conjunction with 
local-constant or local-linear smoothing as follows. To obtain g\, we will 
use a scatterplot smoother which we implement as local-constant or local- 
linear weighted least squares smoothing. Given a parameter function /?, the 
scatterplot targeting the nonparametric regression gi(z) = E(Y\ J x f3X = z) 
consists of the data pairs (f x ^i)i=i,...,n- Omitting the predictor Xj when 
predicting the response at f x (3Xj, averaging least squares smoothers con- 
structed for predicting at the observed predictor levels Xj are then obtained 
by choosing intercept parameters Q and slope parameters i?j to minimize 



^^-(Wr 1 /^-^)) or 



(2.4) 

EE 

i,j ■ 

in the local-constant and local-linear cases, respectively, where K is a kernel 
function and h is a bandwidth. 

Defining Ky = K {h' 1 ^{X, - X,-)}, Xj = (V : , , ; A\ ) A\; 
and Yj = i^2 i . i ^jY i Kij)/'}2 i . i _ /z -Kij, one finds that the minimia of (2.4), 
for any given /3, are 

i,j ■ i+i i,j ■ tyj X ) 

The minimizers Q are given by Q = Cj(P) = Yj m the local-constant case 
and in the local-linear case minimizers Q and $j are given by 



(2 6) 

T.^ikKXi-m 2 ^ ' 1 - J - n - 

Summarizing, the criteria at (2.4) are based on averaging local-constant and 
local-linear fits to gi(f x fix) at x = Xj, averaging over Xj, where the respec- 
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tive fits are computed from the data X\ , X n , excluding Xj . The resulting 
approximations to gi(f x f3Xj), for a given (3, are Yj and Yj + $j(/3) f x /3(Xj — 
Xj), respectively, with i}j(/3) given by (2.6). 

It remains to specify our final estimates. We estimate f3 by conventional 
least squares, aiming to minimize the sum of squared differences between Yj 
and the just-mentioned approximations: 

n 

S{P) = Y J {Y ] -Y 3 m 2 or 

3=1 

S({3) = f^Yj - Yj(P) - Vj(!3)Jj(Xj - Xj) 

subject to J x f3 2 = 1 and with $j((3) as in (2.6). This problem is most con- 
veniently solved by expanding (3 = Yli<k<r bki^k, where ipi,ip2, ... is an or- 
thonormal basis and r denotes a frequency cut-off, choosing the generalized 
Fourier coefficients b^ to minimize S(f3). This gives estimators b\,...,b r of 
b±, . . . , b r , respectively, and from those we may compute our estimator of f3: 

r r 

(2.8) p = ^k subject to Y % = 1. 

k=l k=l 

The basis ipi, "02 ? ■ • ■ can be chosen as a fixed basis such as one of various 
orthonormal polynomial systems or the Fourier basis, or could be another 
sequence altogether, chosen for computational convenience. We note that it 
does not matter for the validity of our results whether the basis functions 
are fixed or random. Therefore the basis can be chosen as estimated eigen- 
function basis, as long as the estimated eigenfunctions are orthonormal. We 
note that irrespective of how it is constructed, the selected basis needs to 
be such that condition (3.4) below is satisfied for the generalized Fourier co- 
efficients of /3, while the additionally needed conditions (3.5), (3.6) depend 
only on properties of j3 and X but not on the choice of the basis. The con- 
dition at (3.4) requires a polynomial decay rate (of arbitrary order) for the 
tail sums of the Fourier coefficients of /?, which is slightly stronger than the 
convergence of the tail sums to that is implied by the square integrability 
of /3. Since we do not assume prior knowledge about /3, no particular basis 
is preferable in this regard a priori. In any case, the theory applies if (3.4) 
holds for the selected basis. In practice, one would choose a basis based on 
how well the representation of (3 works in typical applications. We found the 
choice of estimated eigenfunctions for representing (3 particularly convenient 
for our applications and our implementation is therefore using this basis. 

We note that the criteria at (2.4) are not directly comparable with those 
at (2.7), not least because in (2.4) we are fitting g\ locally and in (2.7) we 
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are fitting f3 globally. Reflecting these two different contexts, each residual 
squared error in both criteria in (2.4) has a local kernel weight, whereas each 
residual squared error in the criteria in (2.7) has a constant weight. 

Having computed /3, we estimate the univariate function g\{u) by con- 
ventional local-constant or local- linear regression on the pairs (f x (3Xi,Yi), 
for 1 < i < n. In particular, in the local-constant case we take 



(2.9) 9l (u) = ]TY^(u) / \Y,Ki{u) 




where Ki(u) = K{h /3Xi — u)}\ in the local-linear setting we choose £ = 
£ and $ = '&, both of which can also be viewed as functions of u, to minimize 
YlA^i ~ (C + ^ fz PXi)} 2 Ki(u), and then put gi(u) = C + #u. Several aspects 
of this algorithm can be modified to improve its performance. For example, 
noting that the ratio on the right-hand side of (2.6) will likely be unstable 
if the denominator is based on a relatively small number of terms, we might 
restrict the sum over j in either formula in (2.5) to values of that index for 
which ijLj Kij — where A > denotes a sufficiently large threshold, 
and repeat this restriction in the case of (2.7). Problems caused by a too- 
small denominator can be especially serious in the case of functional data, 
since sample sizes there are typically relatively small. 

If we take the view that the problem of interest is that of estimating g for 
the purpose of prediction, and that estimating (3 and g\ in their own right 
is of relatively minor interest, then standard cross-validation can be used 
to choose simultaneously the smoothing parameters h and r. In Section 3 
we adopt the perspective of prediction, and show that in that context the 
estimator g approximates g at a rate that is polynomially fast as a function 
of sample size. 

2.3. Multiple index models. The model at (2.3) can be generalized by 
taking g\ to be a p-variate function 

(2.10) g{x)=g i np 1 x,...,£p p x\ = 1 for 1 < j < p. 

However, given the relatively small sample sizes often encountered in func- 
tional data analysis, focusing on the function at (2.10), with p > 2, will 
often lead to estimators with high variability. An alternative, p-component 
functional multiple index model, such as 

(2.11) g(?) = gxU^Pix) + --- + gpU^/3px\ 

is arguably more attractive. This class of models has been considered by [15], 
who referred to them as "Functional Adaptive Models." The approach of 
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James and Silverman was restricted to the parametric case by requiring the 
functional predictors xi as well as the link functions gj to be elements of a 
finite-dimensional spline space, excluding nonparametric (infinite-dimensional) 
link and predictor functions. Such a fully parametric framework allows the 
use of a likelihood-based approach to fitting these models, establishing iden- 
tifiability by extending previous results for the vector case [6]. 

Since our main goal is prediction and not model identification, we are not 
primarily concerned with identifiability issues and do not emphasize specific 
identifiability conditions for the models we consider. The models at (2.10) 
and (2.11) in fact are not identifiable without further restrictions. To appre- 
ciate why, note that the order of the components on the right-hand side of 
(2.10), or of the functions on the right-hand side of (2.11), could be permuted 
without affecting the model. This problem does not arise for conventional 
multivariate or additive models, where the arguments of the functions are 
predetermined as the components of the explanatory variable x. While this 
difficulty can be overcome in a variety of ways, using a recursive additive 
model is attractive on both statistical and computational grounds. We now 
give background for that approach. 

It is not uncommon in statistics to pragmatically alter a difficult prob- 
lem to one that is simpler. Indeed, the introduction of additive models is 
typically motivated in that manner. Thus, we could generalize the prob- 
lem of estimating a link function g, and a slope function /3, in (2.1), subject 
only to smoothness conditions, to that of estimating the intrinsically simpler 
functions defined at (2.11). Alternatively, and more appropriately from the 
perspective of general inference, we would seek to estimate g in (2.10) not 
because we felt that those functions were identical to g in (2.1), but because 
they were relatively accessible approximations to g. Taking this view of the 
problem of estimating, or rather, approximating, the function g in (2.1), and 
accepting that the p-additive function at (2.11) is more likely to be practi- 
cable in functional data analysis than the p-variate function at (2.10), we 
suggest fitting the g in (2.11) recursively, for steadily increasing values of p. 
This "backfitting" approach borrows an idea from projection pursuit regres- 
sion, to use recursive, low-dimensional, projection-based approximations. 

In particular, taking g® = g° where g° denotes the true value of g at 
(2.1), we choose the function g\ of a single variable, and the function to 
minimize, in the case j = 1, the expected value 



More generally, if we have calculated and Oj-i, and previously defined 
g'j-i(x), then we may define by g*](x) = S t j °_ 1 (x) - ffj-i^^-iX) and 
choose gj and f3j to minimize the quantity at (2.12). 





8 



D. CHEN, P. HALL AND H.-G. MULLER 



In the next section we shall show how to calculate estimators gj and (3j of 
gj and f3j, respectively, for j > 1. Note that we do not claim to consistently 
estimate g, in (2.1), unless that function has exactly the form at (2.3) (in 
which case our estimator is g = g±, defined in Section 2.2). Instead we suggest 
developing consistent estimators of successive approximations to g(x), that 
is, of 



(2.13) 




2.4. Estimation in functional multiple index models. Here we generalize 
the methodology in Section 2.2 so that it permits estimation of the functions 
gi,g2,-.. in (2.12). Assume we are fitting a p-index model. The recursive 
fitting procedure means once we have estimators f}~ and gj, for 1 < j < 
k — 1 < p, of the functions (3j and gj defined in the paragraph containing 

(2.12), we take Yi{k) = Y{ — gi(Xj) — • and use the methodology 

in Section 2.2 but with Yi(k) replacing Yj, obtaining an estimator /3, on this 
occasion actually an estimator of and an estimator g, which is really an 

estimator of g^. The quantity gi(f T j3ix) H h g P {J x Pp x ) ls our estimate 

of the p-index model from the recursive fitting procedure. 

A further refinement that leads to smaller prediction errors is backfit- 
ting, which uses the recursive fits described above as a starting point. Once 
these fits are obtained, further updates are obtained iteratively by revisiting 
and updating one index after another, presuming that the remaining p—1 
indexes are fixed. The iterative updating of individual indices is itself iter- 
ated until indices change only little. This is implemented in a similar way 
as described in [6] for a traditional multiple index model with monotone 
link functions. Denoting the estimates obtained from the initial recursive 

fitting procedure by gi(J x $ix) H h gp(J x Pp x ) > then for the dth iteration, 

iterating also through the increasing sequence k = 1,2, . . . ,p, one uses 




to replace Y\ for fitting gf,(f x $fx). The iterative backfitting procedure is 
stopped once the relative differences between and fif fall below a pre- 
specified threshold or a maximum number of iterations is reached. 

3. Polynomial convergence rate. The main result in this section estab- 
lishes that, if the linear model is linked to the response variable as in (2.3), 
if a Holder smoothness condition on the link function g\ is assumed, and if 
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we ask of the generalized Fourier expansion f3 = Y2k>i ^fcV'fc that it converges 
polynomially fast at a sufficiently rapid rate, then the predictor g converges 
to (j at a polynomial rate. That property distinguishes the approach sug- 
gested in this paper from fully nonparametric methods that impose only 
smoothness conditions on the function g, in (2.1), but have much slower 
convergence rates for the predictor. We give explicit theory only in the local- 
constant case, since, as argued at the end of Section 2.2, that approach is 
particularly appropriate when dealing with functional data. The local-linear 
setting can be treated similarly. 

We assume that independent and identically distributed data pairs (Xj, Yi) 
are generated by the model discussed in Section 2: 

Yi = g{Xi) + Si, where the Xi's are square-integrable random 
functions supported on the compact interval I, g is a real- 
valued functional given by g(x) = gi(f x f3°x), g\ is a real- valued 
(3.1) function of a single variable, f3° enjoys the property J x f3° 2 = 1 
and denotes the true value of the square-integrable function 
/3, and the errors are independent of the X^s and of one 
another, and have zero mean. 

The only assumption we make of g\ is that it is bounded and smooth: 

, 2) ^ * S bounded an d satisfies a Holder continuity condition: 
\gi(u) — gi(v)\ < D\\u — v\ ai for all u and v, where a±, D\ > 0. 

The assumption that g\ is bounded can be relaxed. For example, if the 
functions Xi are bounded with probability 1, then f x {3°Xi is uniformly 
bounded, and so the distribution of the response variables Yi depends only 
on the values that g\ takes on a particular compact interval. We can extend 
gi from that interval to the whole real line in such a way that the extended 
version of g\ is bounded and has a bounded derivative. More generally, if 
sup 1< j <n ||Xj|| grows at rate 0(n v ), for all r\ > 0, where ||X|| denotes the 
L2 norm of X (e.g., this condition holds if X is a Gaussian process), and 
if sup|3.|< u |<7i(x)| grows at no faster than a polynomial rate as u diverges, 
then only minor modifications of our proof of the theorem are required to 
establish Theorem 3.1. 

Let X have the common distribution of the random functions in the 
model at (3.1). We ask that ||X|| have at least a small, fractional moment, 
and that all moments of the error distribution be finite. In particular: 

< 00 for some rj > 0, and E(\e\ m ) < (D 2 m) a2m for all 
^ ' ' integers m > 1, where 02,-^2 denote positive constants. 

The condition 2£|e| m < (D2in) a2m is satisfied by distributions the tails of 
which decrease at rate at least exp(— C\x C2 ), for constants C\,Ci > 0, pro- 
vided we choose 02 > I/C2. In particular, the condition is satisfied by expo- 
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nential and Gaussian distributions, and also, in the case C*2 < 1, by many 
distributions that do not have finite moment generating functions. 

Write /(• | /3) for the probability density of j x /3X. Given an orthonormal 
basis tpi , ip2 > • • • f° r the class L2 (X) of square-integrable functions on I, ex- 
press a general function f) E ^(I) with Jj/3 2 = 1 as j3 = ^fcM^fcV'fe) where 
Sfc>i frfc = 1- ^ or constants as, a^, B , D%, D4, D§ > 0, we shall assume that: 



00 

(3.4) ^ b\ < D 3 (l + r)~ B for all r > 1, 

fc=r+l 



(3.5) sup /(x I /3) < 00, 

(3.6) supp|/^y <D 4 «5 a3 for all |u| < 5 j <D 5 5 a \ 

where (3.6) holds for all sufficiently small 5 > 0. Condition (3.4) is standard; 
it asks that the generalized Fourier coefficients of /3 decay at least polynomi- 
ally fast, in a weak sense. To appreciate the motivation for (3.5) and (3.6), 
observe that if X is a Gaussian process for which the covariance operator 
has eigenvalues 9\ > 62 > • • • > and respective eigenfunctions fa , fa, ■ ■ ■ , 
then /(• I P) is the N(a,? 2 ) density, where a = f x /3E(X), ? 2 = ^ k>1 9 k b 2 k 
and b k = f x {3(j) k . Then (3.6) is obtained by using well-known tail bounds for 
the Gaussian distribution function $ with standard Gaussian density cf). It 
follows that (3.5) and (3.6) hold whenever < 04 < 03 < 00 and B is a class 
of functions f3 for which X^fe>i @kb k ^ s bounded away from zero and infinity, 
and for which Y2k>i ^% = ^- ^ ur use °^ ^ n e principal component basis in this 
example serves only to show the reasonableness of conditions (3.5) and (3.6), 
which of course do not depend on choice of basis. It does not imply that the 
basis ipi , ip2 , ■ ■ ■ should be identical to fa , fa , ■ • ■ • 
Of the kernel K and bandwidth h we ask that: 



K is nonnegative and symmetric, has support equal to a com- 
pact interval, decreases to zero as a polynomial at the ends of 
its support, and has a bounded derivative; and h ~ D§n~ G as 
n — > 00, where C, Dq > 0. 



Define /3 to be the minimizer of S(f3) = ^2j(Yj — Yj) 2 [the first quan- 
tity in (2.7), corresponding to local-constant estimation] over functions (3 = 
Y.i<k<r kfcV'fc) constrained by ^2i<k<r^k = 1> f° r wn ich (3.5) and (3.6) hold 
and sup fc>1 \b k \ < Dj, with D-j > D3 [the latter as in (3.4)], and where 

(3.8) r denotes the integer part of Dgn D , for constants D, Dg > 0. 

This is the procedure for constructing g suggested in the argument leading 
to (2.8), in the local-constant case. 
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Theorem 3.1. If (3.1)~(3.8) hold, if B in (3.4) is sufficiently large, 
and if C and D in (3.7) and (3.8) are sufficiently small (all three constants 
depending only on a\, ... ,0,4,), then there exists a constant c>0 such that, 
as n — > 00, 

n 

(3.9) n- 1 J>(X» - g(X 3 )Y = O p (n- c ). 

i=i 

The proof is in the Appendix. It is possible to extend Theorem 3.1 to 
the recursive additive model case formulated in Section 2.4, although the 
argument there is significantly longer. As explained earlier, for the case of 
Gaussian predictors X, the choices a% = 1, = 1, 03 = 1, 04 = 1 are possible 
and then by choosing the other constants judiciously, observing the various 
constraints, one finds that one may obtain the rate of convergence in (3.9) for 
c with c < 1/4. We do not pursue here the question of the optimality of this 
rate of convergence. An assumption that has been made throughout is that 
the predictor trajectories are fully observed. This is an idealized situation. 
It is possible to weaken this assumption, assuming that the trajectories are 
sampled on a dense grid of points so that integrals such as those appearing 
in (2.12) can be closely approximated. 

4. Algorithmic implementation and data illustration. 

4.1. Description of the algorithm. 

Step 1. Estimating f3. We assumed that h, r and the basis {ipi, ■ ■ ■ , Vv} 
(we used eigenbasis in our implementation) in (2.8) and (2.9) were given. We 
set ft = Ylk=i ^fcV'fc) an d the coefficients 61, . . . , b r were estimated by minimiz- 
ing (2.7). Those Yj with ^ . i , ■ Kij < A were dropped from the minimization 
(we chose A = 0.1). Letting ^ = f ipkXi and writing S((3) in (2.7) in terms 
of bi,...,b r , 

, n , x2 
(4.1) 5(6i , . . . , br) = - ( Yj - VijY ) 

for the local-constant case, where 



Wij{bi,...,b r ,h) 



are the terms related to b±,...,b r . For the local-linear case, S(b\, . . . , b r ) is 
more complicated, with similar subsequent steps. 

We note that (b±, . . . ,b r ,h) are not identifiable without constraints, since 
Wij(b\, . . . , b r , h) = Wij(cb\, . . . , cb r ,ch) for any constant c. Meanwhile, if K 
is symmetric, Wij(b\, . . . , b r , h) = Wij(—b\, . . . , —b r ,h). There are at least two 
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ways to ensure algorithmic identifiability. In a first approach, given h, one 
may find (&x, • ■ • , K) by minimizing (4.1), subject to the constraints Ylk=i ^1 = 
1 and b\ > (or > for some b^ ^ if b\ = 0). A second option is to find 
6 r ) that minimizes (4.1) near a suitable starting point (ci,...,c r ), 
satisfying X)I=i C I = 1 an< ^ c i > ^, and then to rescale the solution to ( , J^ 1 , 

. . . , , ^ r , ^ ). The second option is simpler since the unconstrained 

V Efe b fe v Efe fe fe 

minimization is easier to achieve. However, if one wishes to specify h, the 
constraint X^I=i ^1 = 1 nee ds to be enforced in the minimization step. In the 
simulations, we found that both options led to virtually the same solution 
for a well-chosen bandwidth h. 

The minimization step is a nonlinear least squares problem, which can be 
implemented through the optimization package in MATLAB. It is important 
to secure a good starting point for the minimization. We obtained a default 
starting point by searching along each dimension separately. Starting with 
the first dimension, we located a minimum along S(pi), as defined in (4.1), 
along a grid of values of b\ in the interval [0, 1] . After obtaining the minimizer 
xi, we continued to search along the second dimension using values S(x\, 62), 
where 62 varies on a grid within [—1,1]. This approach was then iterated as 
necessary and provided the starting point. 

Step 2. Selecting r and h. Here r is the number of eigenfunctions used in 
(2.8) and h is the kernel bandwidth. We employed 10-fold cross-validation 
to evaluate each pair (h,r). Each of 10 subgroups of curves denoted by 
Vi,...,Vxo was used as a validation set, one at a time, while the remain- 
ing data were used as the training set. For given (h,r), we found (3 as de- 
scribed in step 1 and computed S(r, h) = ^ ]#v k Sfc=i where 5&(r, h) = 

SjeV fc 0j — j^ 2 an d = 9i(f $Xj)> using local-constant or local-linear 
method as described in the paragraph containing (2.9) and assuming only 
Yi in the training set are known. We then found the minimizers of S(r,h), 
which were the selected values for r and h. 

Step 3. Backfitting step. By default, we fitted a single-index functional 
regression model, which meant that predictions g(J (3xi) were obtained via 
(2.5) using the optimal (h,r) chosen in step 2 and the corresponding es- 
timated (3 in step 1. For fitting a p-index functional regression model, the 
fits obtained in an initial single-index step gave only g^{J in (2.13). 

We then replaced Yi by Yi — g®(J p(xi) and repeated steps 1 and 2 to find 

92 (I @2 x i)- This procedure was iterated until p indices were obtained. This 
only gives us the initial estimate of the p-index model. Then for the dth 
iteration and the increasing sequence k = 1, . . . ,p, we used Y^k) defined in 
(2.14) to fit gf(f j%x). The iteration stops once ||/3f _1 - ^f\\ L2 < 0.01 or 10 
iterations are reached. 
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Fig. 1. Sample of 204 absorbance spectra for meat specimens. 

4.2. Illustration for spectrometric data. We applied the proposed model 
to spectrometric data that can be found at http://lib.stat.cmu.edu/ 
datasets/tecator. We used only part of the data with data selection per- 
formed in the same way as in [3] and [12]. These data were obtained for 215 
pieces of meat, for each of which one observes a spectrometric curve Xj, cor- 
responding to an absorbance spectrum measured at 100 wavelengths. These 
spectrometric curves are depicted in Figure 1. The fat content of each sam- 
ple was determined by an analytic method and recorded as a scalar response 
Yi. One is interested in predicting the fat content of each sample directly 
from the spectrometric curve. 

In a preprocessing step, we removed 11 outliers. We also normalized each 
spectrometric curve by subtracting its area under the curve, j Xi(t)dt, be- 
cause we found that the first eigenfunction of the spectral curves is almost 
flat and its eigenvalue is much larger than the others, but the corresponding 
fitted coefficient b\ in (2.8) is close to 0. This normalization step reduced 
the leave-one-curve-out prediction error by more than 30%. The first four 
estimated eigenfunctions for the normalized curves are plotted in Figure 2. 

To fit the functional single-index model, we used 10-fold cross-validation 
to choose the number r of included eigenfunctions in the representation (2.8) 
and the bandwidth for the Epanechnikov kernel, obtaining 4 and 0.0687 for 
these choices. Using the local-linear method described in (2.5) and (2.7), we 
then estimated the regression parameter function /3\ and the link function g\ . 



14 



D. CHEN, P. HALL AND H.-G. MULLER 



eigenfunction 1 
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0.1 
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eigenfunction 3 





900 950 1000 
eigenfunction 4 




Fig. 2. The first four estimated eigenfunctions of the normalized absorbance spectra. 

These function estimates are shown in the upper panel of Figure 3. The 
average leave-one-curve-out squared prediction error for the proposed single- 
index model is 3.51, while fitting a Generalized Functional Linear Model 
(GFLM) led to a prediction error of 4.99, showing substantial improvement 
for the proposed model. 

We further applied the backfitting procedure described in Section 2.4 
to check whether a multiple index functional model is more appropriate 
for these data than a single-index model. The average leave-one-curve-out 
squared prediction errors were found to be 2.39 for the model with two 



regression function 1 



link function 1 




900 950 1000 1050 
regression parameter function 2 





0.2 0.4 
link function 2 




Fig. 3. The estimated regression parameter functions and link functions. Left two panels: 
the estimated regression parameter functions j3\ and P2 for the first and second index, 
respectively; right two panels: the estimated link functions gi and §2 for the first and 
second index, respectively. 
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indices and 2.42 for three indices. The estimated regression parameter func- 
tions fa and link function c/2 are also displayed in Figure 3. The plot of fa 
suggests that the small bump around wavelength 930 is an important indi- 
cator of the fat content level. We note that fa has similar shape as fa except 
for differences around wavelength 975, where it is positive. The model with 
two indices emerges as the best choice for prediction and improves more 
than 50% upon the GFLM and more than 30% upon the single-index model 
in terms of prediction error. 

5. Simulation study. 

5.1. Simulations for single-index models. We studied the finite sample 
performance of five single-index models (2.3). Samples of balanced func- 
tional data consisting of N = 50/200/800 predictor trajectories and a scalar 
response were generated and each predictor function was sampled through 
50 equidistantly spaced measurements in [0, 1]. The predictor functions were 
generated as 

4 

Xi(t) = n(t) + Y,Zik<t>k(t), i = l,...,N, 
k=i 

where fi(t) =t, fa{t) = ^sin(2vrt), <j) 2 {t) = ^cos(2vrt), </> 3 (t) = ^sin(47rt), 
<M*) = -^|Cos(47rf), and £ ik are i.i.d. iV(0, \ k ) with Ai = 1, A 2 = \, A 3 = \, 
A4 = g. Responses Yi were obtained as: 

Model (i): Y% = cos(J 1 (3Xi) + £3 (nonmonotone link); 

Model (ii): Yi = (J^ f3Xi) 2 + £,• (nonmonotone link); 

Model (iii): Yi = J /3Xi + e% (functional linear model; trivially, a mono- 
tone link); 

Model (iv): Y{ ~ Poisson{exp(2 + f Q /3Xi)} (functional generalized Pois- 
son model; a monotone link with heteroscedastic noise); 

Model (v): Yi ~ Binomial(l, ^cos(2 f j3Xi) + |) (functional generalized 
Binomial model; a nonmonotone link with heteroscedastic noise); 

where j3 = -^4>i + ^73^2 + ~^4>3 + ^7q^4 m au m odels. In models (i)-(iii), 
errors £j were simulated as i.i.d. Gaussian noise with mean and var(e) = 
Rv&r{g(f (3X)}. Here R is a measure of the signal-to-noise ratio, with values 
chosen as R = 0.1 and R = 0.5. 

We compared the proposed model with the generalized functional linear 
regression model (GFLM) with unknown link and variance function [17], 
which is a single-index model. In the simulations, we implemented the pro- 
posed model using the local-constant method defined in (2.4) (details can be 
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Table 1 

Simulation results for single-index models (i)-(m). "FSIR" denotes 
the proposed functional single-index regression and "GFLM" denotes 
the generalized functional linear model [17] 



FSIR GFLM 



Model 


N 


R = 


0.1 


R = 


0.5 


R = 


0.1 


R = 


0.5 


RASE 


RSE 


RASE 


RSE 


RASE 


RSE 


RASE 


RSE 


0) 


50 


0.0464 


0.0204 


0.1096 


0.1225 


0.1299 


0.1488 


0.1546 


0.2335 




200 


0.0279 


0.0052 


0.0557 


0.0195 


0.0442 


0.0109 


0.0709 


0.0818 




800 


0.0156 


0.0024 


0.0315 


0.0041 


0.0288 


0.0025 


0.0402 


0.0049 


(") 


50 


0.1334 


0.0304 


0.3071 


0.2240 


0.1914 


0.1423 


0.3329 


0.3407 




200 


0.0731 


0.0065 


0.1549 


0.0223 


0.1058 


0.0150 


0.1838 


0.0840 




800 


0.0399 


0.0025 


0.0844 


0.0047 


0.0702 


0.0028 


0.0970 


0.0053 


(iii) 


50 


0.0970 


0.0341 


0.2562 


0.1705 


0.1024 


0.0546 


0.2378 


0.1819 




200 


0.0486 


0.0078 


0.1122 


0.0332 


0.0463 


0.0068 


0.1030 


0.0204 




800 


0.0226 


0.0030 


0.0526 


0.0083 


0.0237 


0.0026 


0.0558 


0.0071 



found in Section 4.1). Prediction outcomes were quantified by root average 
squared errors RASE = {jj Ylii^i~ 9(1 /^Aj)} 2 } 1 / 2 , where Y\ is our estimate 
of g(J (3Xi) denned in the paragraph containing (2.5), plugging in (3 and al- 
ways leaving Y{ out of the sample when calculating Y^. We also quantified 
the error of the estimated regression parameter function by root squared 
error RSE(/3) = {J0 - p) 2 } 1 ' 2 . Average values of RASE and RSE obtained 
from 100 Monte Carlo runs were then used to evaluate the procedures. 

The results in Tables 1 and 2 indicate that the proposed method works 
clearly better than GFLM for models (i), (ii) and (v), where the link function 
is nonmonotone. For model (iii), the performance of the two methods was 
found to be similar. In this example, the effect of the monotone link func- 

Table 2 

Simulation results for single-index models (hr) and (v) 



FSIR GFLM 



Model 


N 


RASE 


RSE 


RASE 


RSE 


(iv) 


50 


1.798 


0.0767 


1.632 


0.0639 




200 


1.207 


0.0214 


1.064 


0.0137 




800 


0.8117 


0.0071 


0.6880 


0.0045 


(v) 


50 


0.2324 


0.4023 


0.2060 


0.4333 




200 


0.1222 


0.0850 


0.1400 


0.2866 




800 


0.0612 


0.0140 


0.0629 


0.0728 
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tion (here it is linear) would have been expected to favor the GFLM, but 
this may be counteracted by the fact that the GFLM fits an unnecessarily 
complex model in the case of homogeneous errors, as it also includes a non- 
parametric variance function estimation step. In model (iv), where the link 
is monotone and the noise is heteroscedastic, the GFLM not unexpectedly 
performs better, as it is able to target the heteroscedastic errors, improving 
efficiency of the estimates. Overall, it emerges that the proposed method is 
clearly preferable in situations where the link function is nonmonotone. 

5.2. Simulations for multiple index models. We simulated data for five 
multiple index models, using the same processes and settings as described 
in Section 5.1. Three of the models [(vi)-(viii)] contain two indices and two 
models [(ix)-(x)] contain three indices, as follows: 

Model (vi): Y{ = cos(J 1 ft^Q) + 0.5sin(J 1 ftX^) + e\ (two nonmonotone 
link functions), where ft = A=(f>i + -^0 2 + 4g0 3 + 77^4, and ft = 4g=0i - 

73^-76^ + 76^ 

Model (vii): Yi = f Q ftX, + exp(0.5 J Q ftXj) + £j (two monotone link func- 
tions), where ft = A=<pi + 4^ 2 + 7^3 + 4^4 and ft = ^=</>i - -^4>2 - 

Model (viii): Yj = J Q ftX, + 0.5(j* ftXj) 2 + £j (one nonmonotone link 
and one monotone link), where ft = ttj^i + 77J 4>2 + ~^4>3 + 775^4 and ft = 

^1-^2-^3 + 776^4; 

Model (ix): Y t = ft ftX, + exp(0.5 ft*;) + 0.5( ftX 4 ) 2 + e { (three 
link functions), where ft = 7/f^l + 7/f02 + -^^3 + 7/g04, ft = 775 <f>l ~ 775^2- 

7^ 3 + Te^ 4 and & = "T^ 1 + 73^ 2 + T^ 3 + Te^ 

Model (x): Y { = ft ftJ*Q + 0.5(/ 1 ftX;) 2 + 0.25( /J ftJQ) 3 + e { (three link 

functions), where ft = 4=0i + 4=0 2 + 77^3 + 775 04, ft = 775^1 - 775^2 - 
7^3 + 7^4 and ft = -7^1 + 77^2 + ^03 + 

We compared the results from the recursive fitting procedure and the 
backfitting procedure in terms of root average squared errors 

{( k V ^ ^ ^ 1/2 

^E|X>(/ -X>(/ ^) 

for a p-index model when fitting the first k indices. It is of interest to include 
cases k < p (not fitting a sufficient number of indices) and k> p (overfitting 
the number of indices) and to determine whether the best results are ob- 
tained for the correct number of indices, which would suggest choosing the 
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Table 3 

Simulation results for multiple index model (vi)-(vm) with two underlying indices. 
RASE^, k — 1,2,3, stands for root average errors using the recursive fitting procedure 
and k indexes and RASEi for the same errors obtained when using the iterative 
backfitting procedure. Shown are average results based on 100 Monte Carlo runs 



Model 


N 


R 


RASEf 


RASEf 


rase! 1 


RASE{ 


RASE^ 


rase| 


(vi) 


50 


0.1 


0.2975 


0.1960 


0.1872 


0.2975 


0.1209 


0.1483 




200 


0.1 


0.3003 


0.1357 


0.1059 


0.3003 


0.0645 


0.0854 




50 


0.5 


0.3206 


0.2683 


0.2797 


0.3206 


0.2048 


0.2810 




200 


0.5 


0.3051 


0.1778 


0.1754 


0.3051 


0.1235 


0.1427 


(vii) 


50 


0.1 


0.2107 


0.2076 


0.2144 


0.2107 


0.1991 


0.2312 




200 


0.1 


0.1311 


0.1131 


0.1372 


0.1311 


0.1070 


0.1247 




50 


0.5 


0.4238 


0.3937 


0.4592 


0.4238 


0.3786 


0.4335 




200 


0.5 


0.2507 


0.2329 


0.2719 


0.2507 


0.2267 


0.2848 


(viii) 


50 


0.1 


0.4463 


0.3184 


0.3317 


0.4463 


0.2310 


0.2817 




200 


0.1 


0.4061 


0.1496 


0.1594 


0.4061 


0.1016 


0.1279 




50 


0.5 


0.4818 


0.4872 


0.5327 


0.4818 


0.4632 


0.4698 




200 


0.5 


0.4304 


0.2502 


0.2910 


0.4304 


0.2107 


0.2412 



number of indices by fitting various numbers of indices and choosing the 
number according to the model with the best fit. Here gj and f3j are esti- 
mated using both recursive and backfitting procedures. Accordingly, if the 
underlying model, selected from models (vi)-(x), contains p indices, we cal- 
culated the values for RASEfc for k = l,...,p + l. 

As one can see from the results in Tables 3, 4 and 5, the recursive fitting 
procedure often does not identify the right number of indexes and for nearly 
all fits produces larger RASE values, as compared to the iterative backfitting 
procedure. The iterative backfitting method thus emerges as the preferred 
method. 



Table 4 

Recursive fitting results for models (be) and (x) with three indices 



Model 


N 


R 


RASEf 


RASEf 


RASEf 


RASEf 


(ix) 


50 


0.1 


0.5107 


0.3417 


0.3518 


0.3772 




200 


0.1 


0.4791 


0.2196 


0.2132 


0.2183 




50 


0.5 


0.5453 


0.4810 


0.5104 


0.5297 




200 


0.5 


0.5161 


0.3324 


0.3329 


0.3504 


(x) 


50 


0.1 


0.5107 


0.3417 


0.3518 


0.3772 




200 


0.1 


0.4792 


0.2327 


0.2264 


0.2316 




50 


0.5 


0.6631 


0.6461 


0.6111 


0.6418 




200 


0.5 


0.5161 


0.3224 


0.3329 


0.3504 
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Table 5 

Iterative backfitting results for models (ix) and (x) with three indices 



Model 


N 


R 


RASE{ 


RASE2 


rase| 


RASE4 


(ix) 


50 


0.1 


0.5107 


0.3272 


0.3009 


0.3395 




200 


0.1 


0.4791 


0.2084 


0.1422 


0.1988 




50 


0.5 


0.5453 


0.5372 


0.5518 


0.6095 




200 


0.5 


0.5161 


0.3350 


0.3137 


0.3980 


(x) 


50 


0.1 


0.5107 


0.3501 


0.3106 


0.3495 


200 


0.1 


0.4792 


0.2063 


0.1808 


0.1860 




50 


()..-) 


0.6631 


0.6291 


0.5825 


0.6476 




200 


0.5 


0.5161 


0.3372 


0.3129 


0.3248 



APPENDIX: PROOF OF THEOREM 3.1 
We describe the details of the proof by breaking it up into several steps. 

Step 1. Upper bound on mean summed squared error. Define jj = 
g(X J )=g 1 (f x (3°X j ) and 

(A.l) 7, = ( E / E K ^ ^ = ( E / E 

To express their dependence on /3, through Kij = Kij(/3), we shall write jj, 
£j and Yj as y(/3), £j({$) and Yj(f3), respectively. In this notation, S((3) = 
S + Si(/3) + 5 2 (/3) + 25 3 (/3), where S = Yli<j<n £2 j an ^ does not depend 
on /3, 

n n 

sm = E^- - W)} 2 , = E^ - 
3=1 

Furthermore, 5i(/3) = 5 4 (/3) - 25 5 (/3) + 5 6 (/3), where 
n n 

S 4 (/3) = Ei7i - 7iG9)} 2 , 5 5 (/3) = Ei7i - 
3=1 3=1 

(A.3) 

56(/3)=E e " J (/ 3 ) 2 ' 
with notations as in (A.l). 
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Let B\ =B\{n) denote a class of functions /3, and suppose we can prove 
that 

(A.4) sup \S k {p)\ =O p (X n ) for k = 2,3,5,6, 

where A n denotes a sequence of positive constants. Then, 
Sx0) = S0) - {So + S 2 0) + 2S 3 0)} 
<S(P°)-{S O + S 2 0) + 2S 3 0)} 
= St(P°) + 5 2 (/3°) + 25 3 (/3°) - {S 2 0) + 2S 3 0)} 

(A.5) 

= S 4 (f3°)-2S^ ) + S 6 (f3°) + S 2 (f3 ) 

+ 2S 3 (f3°)-{S 2 0) + 2S 3 0)} 

= S 4 (f3°) + O p (\ n ), 

where the inequality follows from the fact that (3 = (3 minimizes S(f3), the 
final identity follows from (A.4) provided that (3° and (3 are both in B\(n), 
and all other identities in this string hold true generally. 

Without loss of generality, the support of K is contained in the interval 
[— 1, 1] [see (3.7)]. If in addition \gi{u) — g\{v)\ <D\\u — v\ ai for all u and v 
[see (3.2)], then \jj — jj(/3°)\ < Dih ai for all j, and therefore 

(A.6) S 4 (/3°)<n( J D 1 / l ai ) 2 . 

Together, (A.5) and (A.6) imply that 

n 

(A.7) " 9^)? = P (K + nh 2 ^). 

3=1 

Step 2. Decomposition of each set Sk((3) into two parts. Let X = {X\ , . . . , 

X n } denote the set of explanatory variables, and for each /3 £ B\ let J = 
>J(P) ^ = {lj • • • > n } denote a random set which satisfies 

(A.8) P[#{ J° \ J (/3)} > 2D 5 nh a4 for some j3 G B\] — >■ 

as n — > oo, where 04 is as in (3.6). (The set J will be A'-measurable.) Define 
5f (/3), for 2 < jb < 6, to be the version of Sk(/3) that arises if, in the defini- 
tions at (2.1) and (2.2), we replace summation over 1 < j < n by summation 
over j G J . Since 5 is bounded, and all moments of the error variables £j 
are finite [see (3.3)], then sup 1<i<n |li| = O p (n 11 ) with probability 1, for all 
T] > 0. Therefore, in view of (A.8), 

(A. 9) max sup \Sk(f3) ~ Si (/3)\ = O p (n 1+ ^h ai ) for all r, > 0. 

fc=l,...,6 /3eBl 
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Step 3. Determining J for which (A.8) holds. Define Tj (/3) = ^ . i K^ , 

recall that /(• | f3) denotes the probability density of J x f3X, and put 

aj (p) = h j K{u)f(J pXj -hu\p) du. 

Then, 

E{T j (P)\X j } = a j (/3), 
var{T.,(/3) | Xj} < (n - l)h J K(u) 2 f(^J (3Xj -hu\p\ du 
< n(sup K)aj(f3). 

Moreover, < Kij < supK. Therefore by Bernstein's inequality, if < c\ < 1, 
P{Tj{f3)<{l- Cl )naj{P)\Xj} 

= P{n aj (f3) - T 3 {P) > c inaj ((3) | Xj} 

(A.10) 

{ Cl naj{f3)} 2 /2 



< exp 



■ exp 



(sup K){naj(/3) + (l/3) Cl naj(/3)} 
c\naj(f3) 



2(supK)(l + (l/3)ci) 



Hence, defining J(fi) to be the set of all integers j such that ctj(/3) >n C2 h, 
where < C2 < 1; and putting C2 = cf/{2(sup K)(l + |ci)}; we obtain 

sup P{Tj(P) < (1 - c x )naj{P) \ Xj] < e^p{-C 2 n 1 ~ C2 h). 

Therefore, since J(P) contains no more than n elements, then 

P{Tj(l3) < (1 - a)naj(P) for some j E J(fi) and some p e £1} 
<n(#5i)exp(-C 2 n 1 - C2 /i). 
Hence, provided 

(A.11) #B X = 0{n- Ci - 1 exp(C 2 n 1 - C2 /i)} 

for some C3 > 0, we have 

(A.12) P{Tj{p) > (1 - ci)naj{p) for all j G J(/3) and all G B{\ -> 1. 

Note, too, that if 03 and 04 are as in (3.6), if K is supported on [—1,1], 
and if 

(A.13) (supiTr 1 ™" 02 < Z^ 3 , 
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i=i 



lf(it)/(jf 0Xj -hu\p)du< 



11 



-C2 



3=1 



where 



Ij = =l\ sup /( f (3Xj -u\p)< D 4 h as \. 

[\u\<h \Jx J J 



>l 

The random variables 1%, . . . , I n are independent and identically distributed, 
and, in view of (3.6), vr(/3) = P{Ij(/3) = 1} < D 5 h ai . Therefore, by Bern- 
stein's inequality, 



p\^y j {P)>2D b h ai ^<p 



< 



cxp 



3=1 

(D 5 h a *) 2 /2 



Hence, provided 
(A.14) 

result (A.8) holds. 



mr(P){l-ir(j3)} + (l/3)D 5 h a * 
<exp(-3 J D 5 n/i a4 /8). 

#B 1 = o{exp(3D 6 nti«/8)}, 



Step 4. Bound for E X {S% (/3) 2m } for k = 2, 3, 5, 6 and integers m > 1. 

Write Ex for expectation conditional on X, let Q = Q((3) denote the in- 
fimum of Yli- i^j Kij over au 3 ^ >?> an d P u ^ a2 = E(e 2 ). Defining = 
K-ij/^Zii ■ h¥=j wj)' taking m > 1 to be an integer, and using Rosenthal's 
inequality, we deduce that for a constant j4(m) depending only on m, 

^(ef) < A{m)la 2m (Y. L 2 X + E(e 2m ) £ L*» ) 
(A.15) 



J2r~\—1 „ TS\'m i cv_2m\ .'-j— (2m— 1) 



< A{m){{a z Q- L supK) m + E{e Zm )Q~ 



(supK) 2 -" 1 }. 
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Therefore, 

E x {Si(P) 2m } 



2m 



(A.16) < I {Ex\ej\ 2m ) ll{2m) \ 

< A{m)n 2m {{a 2 Q~ l svvK) m + E{e 2m )Q- {2m ~ l Xsu V K) 2m - 1 }. 

Moreover, if \g\ < d, then Sf (/3) < n(2Ci) 2 , and so, since Sf {/3) 2 < S{ (/3) x 
Si (13), then 

ExiSitf) 2 " 1 } < S{(f3) m {E x s£(P) 2m } 1 ' 2 

(A.17) 

<{n(2C 1 ) 2 } m {^^(/3) 2m } 1 / 2 . 

More simply, if \g\ < C u then Sf (/3) < n(2Ci) 2 and £\ | 7j - %(/3)| 2m < 
n(2Ci) 2m , both uniformly in /J. Therefore, 

^{5f(/3) 2m }<^(m)[{a 2 Sf(/3)} m + J B(e 2m ) £ | 7j - 7i (/3)| 2m 
(A.18) 

< 4(m)(2Ci) 2m {(n<7 2 ) m + ?i£(e 2m )}. 

Recall that the support of K is contained in the interval [—1,1]. Let 
N\ denote the maximum, over values j E 3 , of the number of indices k 
such that \f x [3(Xj — Xk)\ < h. Then, the series Yli-.i^j^-ij nas ) f° r each 
j, at most 7V"x nonzero terms. Array the values of f x 0Xj, for j € JT, on 
the real line, and group them into consecutive blocks of indices j, each 
block (except for the last remnant block) containing just N\ values. Index 
these blocks, from left to right along the line, from 1 to N2, where N2 
equals l(#3)/Ni\ or [(#3)/Ni\ + 1 and \x\ denotes the integer part of x. 
Choose one point f x (3Xj from each even-indexed block, and remove those 
points from the respective blocks; and repeat this until all the points are 
removed from all the blocks. Record, for each pass through the N2 blocks, 
the removed sequence ji, ■ ■ ■ ,j u of indices. (On the first pass, v will equal 
[A^2/2j or LA2/2J + 1, but on later passes, v may be reduced in size.) Now 
repeat this for odd-indexed blocks. Denote by jki, ■ ■ ■ ,jkM k , for 1 < k < N 
say, the different sequences j± , . . . , j v that are obtained in this way. The 
set of all such sequences represents a (disjoint) partition of the integers in 
3 , and in particular, Mi + • • • + = n. By construction, for each k the 
random variables £j kl £j kl , • • • , £j kMk ^jkM k are independent, conditional on X; 
the random integers N and M\ , . . . , Mjv are measurable in the sigma-field 
generated by X; N < 2JVi; and max fc M fc < [(#J)/(2N 1 )\ + 1. 
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Since 

N 

£ j^j = y^( £ jfci^?'fci + " ■ + £ jkM k £ jkM k )i 

j£j(/3) k=l 

then, for any integer m > 1 and an absolute constant ^4(m) > 1, depending 
only on m, 



E x {Si^) 2 '"} 



2m 



/AT \ 2m 



\k=l 



( N 

< Mm) ]T 

\fe=i 

< A(m)iV 2m 



Mi m Mi 



l/(2m)\ 2m 



x max fcr 2m |M fc max ^(et,))™ + Ee 2m M k max £*(|£ 7 -,,| 2m ) 
Therefore, by (A. 15), 

(A.19) < A{m) 2 N 2m ^{a 2 Q^snpK) m ma^M^ 1 

+ J B(e 2m )g- (2m - 1) (sup^) 2m - 1 max M k \. 

l<k<N J 

The constant A(m) in these bounds can be taken equal to (^4m/logm) m , 
where A > 1 denotes an absolute constant [13, 16]. From this property, and 
results (A.16), (A.17), (A.18) and (A.19), and recalling that N < 2N ± and 
Mk < (n/2N\) + 1, we deduce that for a constant C4 > 1, 

fc=2,3,5,6 

(A.20) < (?n/logm) m/2 (C 4 n) 2m {Q- m / 2 + E(e 2m )Q {1/2) - m } 

+ (m/logm^CTiinNi/Qr + Eie^Mm/Q) 2 " 1 - 1 }. 

The contributions to the left-hand side from and S§ dominate, and so 
the right-hand side represents, in effect, E X {S( (/3) 2m } + E X {S$ (/3) 2m }. 
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Step 5. Upper bounds for N\ and Q~ 1 . Let Tj (ft) denote the version 
of Tj(/3) in the special case where K = 1 on [—1,1] and K = elsewhere, and 
write a^\j3) = hj,,^ f(f x (3Xj — hu\/3) du, representing the corresponding 
version of aAj3). In this notation, N\ = N\{f3) equals the maximum, over j, 

of the values of T^\f3) for j G jT(/3). The argument leading to (A. 10) now 
gives 

P{Tf\f3)>{l + c 1 )naf{f3)\X j } 



P{T] 1] (/3) - naf{P) > Cl naf(f3) \ X,} 



< 



exp 



{ Cl naf\(3)} 2 /2 



naf\f3) + (l/3)c in af(P) 



■ exp 



cW 1] (/3) 



2(l + (l/3)ci 

The analogue of (A. 12) in this setting is, assuming that (A. 11) holds: 
(A.21) P{Tf\p) < (1 + ci)naf (/3) for all j and all /3 G B\} — > 1. 

Since (/3) < /isup/(- | /?), then, using (3.5), we deduce from (A.21) that 
for a constant C5 > 0, 

(A.22) P{Ni{P) < C 5 nh for all j3 G — > 1. 

Observe, too, that 

(A.23) 

^(l-d)- 1 ^ 2 - 1 /!- 1 , 

where the first identity is just the definition of Q; the second, in view of 
(A. 12), holds uniformly in f3 G23i, with probability converging to 1 as n—> 
00; and the third is a consequence of the definition of J{f3) as the set of j 
for which <x/(/3) > n~ C2 h. 

Step 6. Proof of uniform convergence to zero of n^Sf (/3) for k = 
2,3,5,6. Incorporating the bounds at (A.21) and (A.22) into (A. 20), and 
taking m to diverge polynomially fast in n, we deduce that, for constants 
Cq, C7 > 1, and with probability converging to 1 as n— > 00, 

S (m,n)=sup E x {Si(P) 2m ] 
/3eBl fc=2,3,5,6 

< (m/logm) m / 2 (C 6 n) 2m {(n C2 -V/i) m/2 +^(e 2m )K 2 ~V^) m " (1/2) } 
(A.24) + (m/logm) 2m Q m {n m(c2+1) +S(e 2m )n (2m - 1)c2+1 } 
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< (C 7 n) 2m {(mn C2 -7/i) m/2 + (m 2aa n^" 1 / 7i) m } 

+ (CVn 2 ) m {(roV* _1 ) m + {m 2 ^ +1 ^n 2c2 ~ 2 ) m }, 

where, to obtain the last inequality, we used the bound -E|e| m < (D2m) a2Tn 
in (3.3). 

Choose CI-, an d further positive constants C%, Cg, C3, C4, C5, such that 
(A. 25) C2 + 2c3max(l,a2) + C5 < 1 and 0<C4<cs. 

Take m equal to the integer part of n C3 and 
(A.26) C 8 n~ C5 < h< C 9 n~ C4 . 

The constant C2 S (0, 1) was introduced immediately below (A. 10), and, up 
to (A. 25), was subject only to the conditions (A. 13) and < ci < 1. For any 
given 03 and C2, no matter how small the latter, we can ensure that (A. 13) 
holds merely by taking C5 (and thence C4), in (A.26), sufficiently small. Since 
the results below continue to hold no matter how small we choose C5 (and 
C4), then we can be sure that (A. 13) is satisfied. 

It follows from (A.24)-(A.26) that, with probability converging to 1 as 
n — > 00, 

s(m,n) < {C 7 nf m {{n C2+C3+c *- l ) m + ( n «+2a 1C 3+c 5 -l)m 

_|_ ( n C2+2c 3 -l^m. _j_ j- n 2{c 2 +c 3 (a2+l)+C2-l}^m j 

< 4(C 7 n C6+1 ) m , 

where c 6 = max{c2 + C3 + C5,C2 + 2aiC3 + C5,C2 + 2c3,C2 + c 3 (a2 + l) + C2} < 1. 
Therefore, if < c 7 < 1 — cq and we put c$ = (1 — cq — c 7 )/2 > 0, then, by 
Markov's inequality, 



Pin" 1 sup V |Sf (/3)| >n~ C8 \ x\ < 16 m (#fii)s(m, n)n" 



2m(l-c 8 ) 



(A ' 27) <4(#Bi)(16C 7 n" C7 ) m , 

where the inequalities hold with probability converging to 1 as n — > 00. 
Hence, provided that 

(A.28) (#fii)(16C 7 n- C7 ) m ^0, 

the left-hand side of (A. 27) converges in probability to zero as 00. It 
follows that the unconditional form of that probability also converges to 
zero, and hence that 

(A.29) Pin- 1 sup max \S? (B)\ > n- C8 \ 0. 

I £1=2,3,5,6 fc J 



FUNCTIONAL REGRESSION WITH NONPARAMETRIC LINK 27 



Step 7. Completion. Prom (A. 9) and (A. 29) we deduce that, for all r\ > 0, 
(A.30) n" 1 sup max \S k (B)\ = OJn r 'h ai + n" C8 ) = OJn^ + n~ Cs ), 

p eBl fc=2,3,5,6 

where we used (A. 26) to derive the final identity. Therefore, (A. 4) holds with 
A ra = n 1_C9 for any eg E (0, max(c4, cs)). Hence we may use this value of X n 
in (A. 7), establishing that 

n 

(A.31) n" 1 5>(^) - 9(X 3 )} 2 = O p (n~ c ), 

with c = min(cg, 2a\c^), where the estimator f3 used to define g is obtained 
by minimizing S(f3) = ^2j(Yj — Yj) 2 [the first quantity in (2.7)] over /3 E B\. 
[We used (A. 26) to simplify the term in h 2ai in (A. 7).] 

During the proof above we imposed on the class B\ the assumption that 
/3° E B\ [see the discussion following (A. 5)], and also three conditions — 
(A. 11), (A. 14) and (A. 28) — on the size of the class. The latter three condi- 
tions hold if 

(A.32) #Bi = 0{exp(n Cl0 )}, 

provided < c\o < min(l — C2 — C5, 1 — CI4C5, C3). (Recall from Step 6 that m 
equals the integer part of n C3 .) By choosing C5 smaller if necessary we can 
ensure that the upper bound here is strictly positive, and so cio > 0. 

Let < en < cio and C12 > 0, define r = r(n) to be the integer part of n cn , 
and let D3 be as in (3.4). Let r be as stipulated in (3.8), and write B2 for the 
class of functions /3 = Si<fc<r^feV ; fc such that each \bk\ < -D3 for 1 < k < r. 
Let B3 be the set of elements of B2 for which each for 1 < k < r, is an 
integer multiple of n _Cl2 . The number of elements of B3 is bounded above 
by a constant multiple of 

(A.33) (2D 3 n Cl2 ) r < exp(const. n Cl1 logn) = o{exp(n Cl0 )}. 

Put Bi = B 3 U {/3 }. Then (A.32) follows from (A.33). 

The following three properties hold: (a) The lattice on which B3 is based 
can be made arbitrarily fine in a polynomial sense, by choosing cyi suffi- 
ciently large; (b) £J||X|| r? < 00 for some r\ > [see (3.3)]; and (c) K has a 
bounded derivative [see (3.7)]. Given (3 = £i<fc<r G B 2 , let P approx = 
Si<fc< r ^fc PPr ° X ^fc be the element of £>2 defined by taking ^P piox to be the 
lattice value nearest to b^, for 1 < k < r. Define S'(n) to equal the maximum, 
over l<i,j<n, of — Property (b) implies that 5 n = O p (n Cl3 ) for 
some C13 > 0. Using this property, and (a) and (c), it can be proved, by 
taking c\i sufficiently large, that for any given C14 > 0, 

sup max |^(/3)-^(/3 a PP rox )| 
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(A.34) 



O. 



v 




/3GB 2 



iapprox 



} 



O p (n 



From this result and the other properties of K in (3.7) it can be shown 
that (A. 31) continues to hold if (3 in the definition of g is replaced by the 
minimizer of S((3) = Y.j( Y j ~ Yj? over P^B 4 = B 2 U {(3°}. 
Call this result (R). 

The desired result (3.9) follows from (R), except that the set B4 contains 
(3° as an unusual, adjoined element. Hence there is, in theory, a possibility 
that /3 = /3°; this could not happen if we were to restrict f3 to elements 
of £>2i as required when defining the estimator g in (3.9). To appreciate 
that this does not cause any difficulty, let (3 1 = ^i<fc< r ^fc^ denote the 
approximation to (3° obtained by dropping all but the first r terms in the 
expansion (3° = ^fc>i^fc^fc- The argument leading to (A.34) can be used 
to prove that, for C14 > chosen arbitrarily large, there exists a value of 
B = 13(014), hi the second part of (3.4), such that 



Arguing as before, this leads to the conclusion that (3 can be dropped from 
B4 without damaging result (R). 
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